7 research outputs found
Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes
We present an end-to-end binaural audio rendering approach (Listen2Scene) for
virtual reality (VR) and augmented reality (AR) applications. We propose a
novel neural-network-based binaural sound propagation method to generate
acoustic effects for 3D models of real environments. Any clean audio or dry
audio can be convolved with the generated acoustic effects to render audio
corresponding to the real environment. We propose a graph neural network that
uses both the material and the topology information of the 3D scenes and
generates a scene latent vector. Moreover, we use a conditional generative
adversarial network (CGAN) to generate acoustic effects from the scene latent
vector. Our network is able to handle holes or other artifacts in the
reconstructed 3D mesh model. We present an efficient cost function to the
generator network to incorporate spatial audio effects. Given the source and
the listener position, our learning-based binaural sound propagation approach
can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX
2080 Ti GPU and can easily handle multiple sources. We have evaluated the
accuracy of our approach with binaural acoustic effects generated using an
interactive geometric sound propagation algorithm and captured real acoustic
effects. We also performed a perceptual evaluation and observed that the audio
rendered by our approach is more plausible as compared to audio rendered using
prior learning-based sound propagation algorithms.Comment: Project page: https://anton-jeran.github.io/Listen2Scene
M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec
We introduce M3-AUDIODEC, an innovative neural spatial audio codec designed
for efficient compression of multi-channel (binaural) speech in both single and
multi-speaker scenarios, while retaining the spatial location information of
each speaker. This model boasts versatility, allowing configuration and
training tailored to a predetermined set of multi-channel, multi-speaker, and
multi-spatial overlapping speech conditions. Key contributions are as follows:
1) Previous neural codecs are extended from single to multi-channel audios. 2)
The ability of our proposed model to compress and decode for overlapping
speech. 3) A groundbreaking architecture that compresses speech content and
spatial cues separately, ensuring the preservation of each speaker's spatial
context after decoding. 4) M3-AUDIODEC's proficiency in reducing the bandwidth
for compressing two-channel speech by 48% when compared to individual binaural
channel compression. Impressively, at a 12.6 kbps operation, it outperforms
Opus at 24 kbps and AUDIODEC at 24 kbps by 37% and 52%, respectively. In our
assessment, we employed speech enhancement and room acoustic metrics to
ascertain the accuracy of clean speech and spatial cue estimates from
M3-AUDIODEC. Audio demonstrations and source code are available online at
https://github.com/anton-jeran/MULTI-AUDIODEC .Comment: More results and source code are available at
https://anton-jeran.github.io/MAD
GWA: A Large High-Quality Acoustic Dataset for Audio Processing
We present the Geometric-Wave Acoustic (GWA) dataset, a large-scale audio
dataset of over 2 million synthetic room impulse responses (IRs) and their
corresponding detailed geometric and simulation configurations. Our dataset
samples acoustic environments from over 6.8K high-quality diverse and
professionally designed houses represented as semantically labeled 3D meshes.
We also present a novel real-world acoustic materials assignment scheme based
on semantic matching that uses a sentence transformer model. We compute
high-quality impulse responses corresponding to accurate low-frequency and
high-frequency wave effects by automatically calibrating geometric acoustic
ray-tracing with a finite-difference time-domain wave solver. We demonstrate
the higher accuracy of our IRs by comparing with recorded IRs from complex
real-world environments. The code and the full dataset will be released at the
time of publication. Moreover, we highlight the benefits of GWA on audio deep
learning tasks such as automated speech recognition, speech enhancement, and
speech separation. We observe significant improvement over prior synthetic IR
datasets in all tasks due to using our dataset.Comment: Project webpage https://gamma.umd.edu/pro/sound/gw
AdVerb: Visually Guided Audio Dereverberation
We present AdVerb, a novel audio-visual dereverberation framework that uses
visual cues in addition to the reverberant sound to estimate clean audio.
Although audio-only dereverberation is a well-studied problem, our approach
incorporates the complementary visual modality to perform audio
dereverberation. Given an image of the environment where the reverberated sound
signal has been recorded, AdVerb employs a novel geometry-aware cross-modal
transformer architecture that captures scene geometry and audio-visual
cross-modal relationship to generate a complex ideal ratio mask, which, when
applied to the reverberant audio predicts the clean sound. The effectiveness of
our method is demonstrated through extensive quantitative and qualitative
evaluations. Our approach significantly outperforms traditional audio-only and
audio-visual baselines on three downstream tasks: speech enhancement, speech
recognition, and speaker verification, with relative improvements in the range
of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly
satisfactory RT60 error scores on the AVSpeech dataset.Comment: Accepted at ICCV 2023. For project page, see
https://gamma.umd.edu/researchdirections/speech/adver
Towards Improved Room Impulse Response Estimation for Speech Recognition
We propose to characterize and improve the performance of blind room impulse
response (RIR) estimation systems in the context of a downstream application
scenario, far-field automatic speech recognition (ASR). We first draw the
connection between improved RIR estimation and improved ASR performance, as a
means of evaluating neural RIR estimators. We then propose a GAN-based
architecture that encodes RIR features from reverberant speech and constructs
an RIR from the encoded features, and uses a novel energy decay relief loss to
optimize for capturing energy-based properties of the input reverberant speech.
We show that our model outperforms the state-of-the-art baselines on acoustic
benchmarks (by 72% on the energy decay relief and 22% on an early-reflection
energy metric), as well as in an ASR evaluation task (by 6.9% in word error
rate)